Skip to content

BDMS-576: Add transfer accountability tooling, duplicate WellData comparison, transfer hardening/perf, and nullable schema updates#537

Merged
jirhiker merged 14 commits intostagingfrom
BDMS-576-Verify-that-all-data-associated-with-relevant-PointIDs-was-transferred-from-SQL-to-Postgres-successfully
Feb 23, 2026
Merged

BDMS-576: Add transfer accountability tooling, duplicate WellData comparison, transfer hardening/perf, and nullable schema updates#537
jirhiker merged 14 commits intostagingfrom
BDMS-576-Verify-that-all-data-associated-with-relevant-PointIDs-was-transferred-from-SQL-to-Postgres-successfully

Conversation

@jirhiker
Copy link
Copy Markdown
Member

@jirhiker jirhiker commented Feb 22, 2026

Use this updated PR description (adds the recent work):

Summary

This PR delivers a broad transfer-quality and operability update against staging, including:

  1. Transfer comparison/reporting framework
  2. New CLI tooling for transfer diagnostics (duplicated WellData + well smoke test)
  3. Transfer runtime/performance improvements across high-volume paths
  4. Constraint relaxations + schema/model alignment for nullable fields
  5. Contact transfer robustness improvements (email normalization/fallback handling + idempotent enrichment)
  6. Logging separation for CLI vs transfer runs
  7. New directory-level READMEs for maintainability
  8. End-to-end smoke-test parity fixes (source/destination normalization and transfer-faithful contact simulation)

What Changed

1. Transfer results / comparison framework

Added a full comparison/reporting subsystem to evaluate source-vs-destination transfer outcomes:

  • transfers/transfer_results_builder.py (new)
  • transfers/transfer_results_specs.py (new)
  • transfers/transfer_results_types.py (new)
  • cli/cli.py: transfer-results command

This produces summary artifacts for transfer verification and agreed/missing/extra analysis.


2. New CLI diagnostics

Enhanced cli/cli.py with first-class commands:

  • transfer-results
  • compare-duplicated-welldata (new)
    • Finds duplicated WellData.PointID rows
    • Exports:
      • summary CSV (per duplicate PointID)
      • detail CSV (row × differing column values)
    • Supports optional --pointid filters
    • Supports optional transfer-like prefilters
  • well-smoke-test (new, first-class)
    • Random/sample-based or full-population smoke checks across related entities
    • New --all-wells/--sampled mode to run entire selected population (e.g. all agreed wells)
    • CSV/JSON outputs under transfers/metrics

3. Transfer logging split

Separated CLI logs from transfer-script logs:

  • transfers/logger.py supports context-aware log routing via OCO_LOG_CONTEXT
  • cli/cli.py sets OCO_LOG_CONTEXT=cli

Result:

  • CLI logs -> cli/logs/cli_*.log
  • transfer logs -> transfers/logs/transfer_*.log
  • bucket folders split (cli_logs vs transfer_logs)

4. Nullable constraints + migration chain updates

Schema/migration updates to support real-world nulls seen during transfer:

New migrations:

  • 8c9d0e1f2a3b_make_measuring_point_height_nullable.py
  • 9a0b1c2d3e4f_make_address_postal_code_nullable.py
  • a1b2c3d4e5f7_make_deployment_installation_date_nullable.py
  • b3c4d5e6f7a8_make_wellscreen_depths_nullable.py
  • c4d5e6f7a8b9_make_address_city_state_nullable.py

Also:

  • removed merge migration: 43bc34504ee6_merge_migrations_after_staging_merge.py
  • adjusted migration linkage in 50d1c2a3b4c5_add_unique_index_ngwmn_wellconstruction.py

Model/schema alignment:

  • db/contact.py, schemas/contact.py (city/state/postal_code nullable handling)
  • db/deployment.py, schemas/deployment.py
  • db/measuring_point_history.py, schemas/thing.py, db/thing.py

5. Contact transfer robustness

transfers/contact_transfer.py updates include:

  • Email normalization:
    • strips Email: prefix
    • trims trailing punctuation (e.g. user@x.com.)
  • Phone-in-email-field handling:
    • phone-like values in email fields are routed through phone parsing instead of hard-failing as email
  • Dedupe/cache improvements
  • New idempotent enrichment behavior for reused contacts:
    • reused contacts now receive missing phones/emails/addresses (instead of early-return skip)
    • prevents missed address backfills on shared contacts across wells
  • Association idempotency:
    • avoids duplicate ThingContactAssociation rows on reruns

Added unit test:

  • tests/unit/test_contact_transfer_email_utils.py

6. Transfer performance/hardening updates

Improvements across transfer modules for throughput and reliability, including updates in:

  • transfers/geologic_formation_transfer.py
  • transfers/link_ids_transfer.py
  • transfers/thing_transfer.py
  • transfers/sensor_transfer.py
  • transfers/surface_water_data.py
  • transfers/waterlevels_transfer.py
  • transfers/well_transfer.py
  • transfers/util.py
  • transfers/transfer.py
  • transfers/transferer.py

(Shift toward batch/core insert paths and reduced per-row overhead where applicable.)


7. Documentation additions

Added directory READMEs:

  • api/README.md
  • cli/README.md
  • db/README.md
  • transfers/README.md
  • tests/README.md
  • transfers/relaxed_constraints.md

8. Smoke-test parity fixes (recent)

transfers/smoke_test.py was updated to match transfer behavior and remove false positives:

  • Organization normalization parity with transfer mapper/lexicon validation
  • Contact simulation now mirrors transfer dedupe + fallback semantics
  • Supports incomplete-phone parity (IncompleteNMAPhone)
  • Contact name normalization collapses repeated whitespace
  • Deployment source signature parity:
    • parses legacy dates
    • estimates missing install dates using same estimator path used by transfer
  • Presence status for value-compared entities derived from normalized value sets (not raw count artifacts from shared-contact reuse)

Validation / Verification Performed

  • CLI wiring validated for:
    • transfer-results
    • compare-duplicated-welldata
    • well-smoke-test (including --all-wells)
  • Targeted compile checks on touched Python modules
  • Spot checks on duplicate WellData comparison output
  • Log inspections confirmed CLI/transfer log split
  • Smoke test run on full agreed population:
    • sampled_wells=9887
    • presence_mismatches=0
    • value_mismatches=0
    • failed_wells=0

Note: full pytest suite is environment-dependent (DB-backed) and was not fully executable in sandboxed context.


Migration Notes

Before running transfers with this branch:

source .venv/bin/activate
alembic upgrade head

Risk / Impact

  • Transfer behavior changes in contact dedupe/enrichment, nullable handling, and email/phone parsing are intentional.
  • Schema changes are additive/relaxing (nullable), but downstream non-null assumptions should be reviewed.
  • Operational improvements from diagnostics and log separation are expected.
  • Smoke output is now materially more trustworthy (reduced false positives from transfer-semantic drift).

Suggested Reviewer Focus

  1. transfers/transfer_results_specs.py agreed/missing semantics
  2. transfers/contact_transfer.py dedupe + idempotent enrichment logic for reused contacts
  3. migration chain/order and downgrade expectations
  4. well-smoke-test parity assumptions in transfers/smoke_test.py
  5. duplicate WellData comparison command output format and utility

Copilot AI review requested due to automatic review settings February 22, 2026 21:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements extensive improvements to data transfer verification and performance optimizations for the BDMS-576 project. The PR focuses on verifying successful data migration from SQL Server to PostgreSQL and includes significant performance enhancements to transfer scripts, along with schema relaxations to accommodate legacy data quality issues.

Changes:

  • Added comprehensive transfer verification framework with TransferResult types, comparison specs, and a builder for generating transfer summary reports
  • Implemented batch insert optimizations across multiple transfer modules (thing_transfer, geologic_formation, link_ids, contact_transfer) to improve performance
  • Relaxed database constraints (measuring_point_height, deployment.installation_date, well_screen depths, address fields) to nullable to accommodate incomplete legacy data
  • Enhanced data quality handling with email/phone field detection, measuring point height rounding, and improved contact deduplication

Reviewed changes

Copilot reviewed 44 out of 44 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
transfers/well_transfer.py Added progress tracking with transferred_count for parallel wells transfer
transfers/waterlevels_transfer.py Changed logging levels, renamed stats field to rows_missing_participants, continued processing without field_event_participants
transfers/util.py Added _round_sig_figs function to round estimated measuring point heights to 2 significant figures
transfers/transferer.py Removed error logging for missing sample_pt_id matches
transfers/transfer_results_types.py New file defining TransferResult dataclasses for verification
transfers/transfer_results_specs.py New file with 29 transfer comparison specs and filter functions
transfers/transfer_results_builder.py New file implementing TransferResultsBuilder for comparing source CSV to destination DB
transfers/transfer.py Made bucket upload conditional on SAVE_TO_BUCKET env variable
transfers/thing_transfer.py Refactored to batch insert pattern with bulk Location/Thing/Notes/DataProvenance inserts
transfers/surface_water_data.py Removed null thing_id checks, allowing nullable thing_id
transfers/sensor_transfer.py Changed installation_date logging from critical to warning, allows NULL
transfers/relaxed_constraints.md New documentation file listing 10 relaxed constraints
transfers/logger.py Added OCO_LOG_CONTEXT env support for cli vs transfer log segregation
transfers/link_ids_transfer.py Refactored to batch insert pattern with bulk ThingIdLink inserts
transfers/geologic_formation_transfer.py Refactored to bulk upsert with on_conflict_do_nothing
transfers/contact_transfer.py Added contact caching, phone-in-email detection, email normalization
transfers/README.md New README documenting transfer structure and performance rules
tests/unit/test_contact_transfer_email_utils.py New tests for email normalization and phone detection
tests/test_util.py Added tests for measuring point height rounding
tests/test_cli_commands.py Added test for transfer-results CLI command
tests/features/environment.py Added DROP_AND_REBUILD_DB env guard for test setup/teardown
tests/README.md New README documenting test structure
schemas/thing.py Made measuring_point_height nullable, disabled depth validations for transfer
schemas/sample.py Made field_event_participant_id nullable
schemas/deployment.py Made installation_date nullable
schemas/contact.py Made address city, state, postal_code nullable
pyproject.toml Added transfers to packages list
db/thing.py Made screen_depth_top and screen_depth_bottom nullable
db/measuring_point_history.py Made measuring_point_height nullable
db/deployment.py Made installation_date nullable
db/contact.py Made address city, state, postal_code nullable
db/README.md New README documenting DB structure
core/lexicon.json Added well_construction_method to Unknown term categories
cli/cli.py Added transfer-results and compare-duplicated-welldata commands
cli/README.md New README documenting CLI structure
api/README.md New README documenting API structure
alembic/versions/*.py Added 5 new migrations for nullable changes, deleted 1 merge migration, updated down_revision for 1 migration

@jirhiker jirhiker changed the title Bdms 576 verify that all data associated with relevant point i ds was transferred from sql to postgres successfully BDMS-576: Add transfer accountability tooling, duplicate WellData comparison, transfer hardening/perf, and nullable schema updates Feb 22, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 41ff8de1ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copilot AI review requested due to automatic review settings February 22, 2026 21:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 45 out of 45 changed files in this pull request and generated 14 comments.

return mphs, mph_descs, start_dates, end_dates


def _round_sig_figs(value: float, sig_figs: int) -> float:
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _round_sig_figs function has a parameter type annotation of value: float but handles None, NaN, non-numeric values, and various edge cases. The type annotation should be value: float | None or value: Any to accurately reflect the accepted input types.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +10
Address.postal_code is nullable
Thing measuring_point_height is nullable
ValidateWell, depth validation removed
Deployment.installation_date is nullable
CreateWellScreen depth validation removed
FieldEventParticipants not required
screen_depth_bottom is nullable
screen_depth_top is nullable
city nullable
state nullable No newline at end of file
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @ksmuczynski. All of these constraints should be reenabled after the NMA migration is completed

@ksmuczynski
Copy link
Copy Markdown
Contributor

I agree, in the interest of transferring legacy data successfully and in its entirety, allowing NULL measuring point heights in the MeasuringPointHistory table is acceptable for now. Approved!

Copy link
Copy Markdown
Contributor

@marissafichera marissafichera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks bueno. Thanks!

Copilot AI review requested due to automatic review settings February 23, 2026 22:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 47 out of 48 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

schemas/thing.py:57

  • ValidateWell.validate_values() now returns early, leaving the depth-validation logic below permanently unreachable. If the intent is a temporary transfer-only relaxation, consider guarding the validation behind a feature flag/env var (or a schema option) and removing the dead code path to avoid confusion and accidental reintroduction issues later.
    @model_validator(mode="after")
    def validate_values(self):
        # todo: reenable depth validation. removed for transfer
        return self

        if self.hole_depth is not None:
            if self.well_depth is not None and self.well_depth > self.hole_depth:
                raise ValueError(
                    "well depth must be less than than or equal to hole depth"
                )

Comment on lines +615 to +632
if contacts_to_create:
try:
created_contact_ids = (
session.execute(
insert(Contact).returning(Contact.id),
contacts_to_create,
)

self._created_contacts[(name, organization)] = contact
.scalars()
.all()
)
except Exception as e:
logger.critical(
"Contact insert failed for PointID=%s, GlobalID=%s: %s",
row.PointID,
row.GlobalID,
str(e),
)
else:
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _get_field_event_participant_ids, if the bulk insert(Contact) fails, the exception is logged but the session is not rolled back. After a DB/SQLAlchemy execution error, the session is typically left in a failed transaction state and later inserts in the same group will raise PendingRollbackError. Roll back (and ideally capture an error) in the except branch, and consider returning an empty participant list for that row/group so downstream inserts can continue safely.

Copilot uses AI. Check for mistakes.
Comment on lines +207 to +216
def _transfer_hook(self, session: Session):
self._build_contact_caches(session)

groups = self._get_group()
pointids = [
idx[0] if isinstance(idx, tuple) else idx for idx in groups.groups.keys()
]
things = session.query(Thing).filter(Thing.name.in_(pointids)).all()
thing_by_name = {thing.name: thing for thing in things}
logger.info(
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_build_contact_caches() stores live Contact ORM objects for reuse, but _group_step can still session.rollback() on exceptions (later in the flow). A rollback can discard newly-created (uncommitted) contacts and leave these cached ORM instances out of sync with what’s actually persisted, which can break dedupe/association logic in subsequent iterations. To keep caches consistent, consider isolating per-row/per-group work in a nested transaction, or clearing/rebuilding the caches after any rollback.

Copilot uses AI. Check for mistakes.
@jirhiker jirhiker merged commit 1ef94a9 into staging Feb 23, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants